****** CompressTools ******

Toolbox to help developers to compress their data in small embedded systems.
(c) 2012-2015 Bregalad

IMPORTANT NOTE : This text file is best viewed with auto-line return enabled, and with UTF-8 coding.

 ********************
** Archive contents **
 ********************

/src   -------------------------------- The source code of compress tools
/script_examples ---------------------- Examples for input script / conversion script
compress_tools.exe -------------------- Win32 executable pre-compiled version of compress tools
CompressTools_2.0_java_legacy.7z ------ Compress tools 2.0 using java (only as fallback for people having trouble compilig 3.0)
readme.txt ---------------------------- This file

******************
** Introduction **
******************

I originally made this tool for helping my NES game development, where I needed to store lot of data into an embedded system with very low RAM and ROM available. However, the concept and usage of this tool would generate to any embedded system with low resources.

I was unhappy with the existing tools to compress and decompress files for multiple reasons :

1) All of them takes a binary file, and compressed it in a binary file. This is way too restricting, and not suitable for many development applications !
   What for example if I want to use symbolic values, in order to be able to change them later ? What if I want to encode non-ascii text without typing everything in a hex editor manually ?
   For this reason, CompressTools works from an input script file, and can output data in assembly, C or binary formats.

2) All of them compress the data in a single big compressed block.
   It is very common to have to compress a large set of data, but only use small blocks of data at once. For example, you want to compress all the levels of your games using the same algorithm, but you only need to load one screen at a time, and it would be way too slow and resource wasteful to decompress ALL the levels only to get the single screen you need. However, compressing every screen separately is tedious, and will get very poor results as usually compression works well only if the original data is big enough.
   If you want to compress text, then it's the same, you only want a single sentence at a time, and don't want to decompress all the script. This is called pseudo-random access - that is I want to access the start of any block of data at any time (as opposed to real random-access, which allows to read any byte of data at any time).
   Most tools doesn't allow for simple pseudo-random access, CompressTools does.
   The notion of "data blocks" has been kept in mind from the very beginning while developing CompressTools. Anyone that doesn't use this is free to just make the data in a single data block.

3) The decompression algorithms for most existing tools are too complex, too slow and needs to much RAM for the NES which has only 2k of RAM and 32k of ROM.
   Even algorithms made for the Commodore 64, which also have a 8-bit 6502 processor, requires the usage of large RAM blocks which are absent on the NES.
   
4) Most existing tools just applies a compression algorithm on data, and doesn't do analysis.
   CompressTools does full analysis on all compression algorithms, to allow the user to see in a second which algorithms work the best and which ones works the less. Some algorithms code stuff on various numbers of bits, I even made it compress using ALL possible numbers of bits and return the most efficient one automatically. In other words, CompressTools doesn't just applies algorithms blindly, it will really analyse which variant of an algorithm works the best for YOUR data.

5) Sometimes your data has advantageous properties, for example if all the data first in a 0-2^n, n<8 range, and that not all bits are used.
   Most generic compression algorithms does not even check for properties, and assumes the data uses full 8-bit range. Fact is, data which uses all 256 symbols values are extremely rare !
   Even on larger executable or graphics files I tested the encodings on, there were always some ranges of completely unused bytes ! This fact is an absolute silver bullet when it comes to compression. Of course CompressTools also comes with algorithms that will still work in the worst case where all 256 symbols are used, without being restricted to this.

6) CompressTools allows output of an assembly or C-header style file declaration of constants,
   for easily including the resulting file in a project with a simple "include" directive. It also allows to output data in a single binary file or multiple binary files (one file per block), depending on the user's tastes.

7) CompressTools uses a plugin-like system which allows anyone to easily add his own algorithms to the list ! This only requires a couple of modifications in the main program to add your own algorithm. No advanced C++ knowledge is needed, just the basics, as the existing algorithms source code are distributed as an example, and are heavily commented so that anyone can understand them.

8) Since all the (included) decoding algorithms are relatively simple, it is not hard to code a similar functionality in assembly or any other language in an embedded project. Low-level decompressors were purposely not included to the package: the user is supposed to understand fully how the algorithm he's going to use is working, and write his own implementation ! However C++ decoders and full documentation are provided.

9) For the most part, the implemented algorithms don't need any additional RAM (RAM-LZ77 is currently the only exception), only RAM to store the longest block is required at worst.
   In some cases, it might even be possible not to store anything in RAM, but only a few pointers and counters, and do a function that returns the next byte of the encoded block each time it is called. This would typically be great for text compression, as you only need a letter at a time really. For level maps compression however, this is not an option.
  
10) Since version 1.3, CompressTools allows to force one particular algorithm and to select the output file name. This means that, it is possible to include CompressTools in your development chain. In other words, with a makefile, call CompressTools whenever the input script is changed, and to generate the necessary output file in order to be further included in the project's sources.

11) Since version 2.0, CompressTools allows full unicode charset in it's strings, this means complete support for all symbols and characters of all languages, as long as those "exotic" characters are mapped into the 8-bit range.


***********
** Usage **
***********

CompressTools is a command line utility. Since version 3.0 there is no need for the Java runtime, and the rules for command lines options have been softened, the order of the arguments does not matter at all, and if something is wrong with the arguments you'll get an (hopefully meaningful) error or warning.

Usage : compress_tools [flags] input_file

flags :
-mb :  Output to multiple binaries. Each encoded block is output to a binary
       file which is included in the assemly file, instead of being coded in the assembly file.
-sb :  Sinble binary. This encodes all blocks in a single binary file instead of
       having a different file for each block.
-c :   Output to C header file syntax instead of assembly syntax.
-ptr : Output a pointer table at the end of the assembly or header file
-shl : Separate low and high bytes in the pointer table at the output assembly file
       (has no effect on C-style pointer table).
-byte: Output assembly file is written using the .byte and .word keyword instead of .db and .dw
       respectively to define data
-fa :  Force using a particular algorithm only. For example -fa RLE forces using the RLE algorithm only.
       Exemple : -fa BitpackRLE forces bitpack RLE, etc...
-o :   Set output filename. Should be used together with -fa, otherwise each algorithm will
       overwrite the same output file.


*******************
** Input scripts **
*******************

Input scripts are the #1 reason CompressTools is better than compression utilities released before it.
Scripts are very similar to assembly code and are parsed by what one could call a mini-assembler.
However I don't recommend to name them ".asm" in order to not confuse them with assembly output files. Call them ".txt", ".s" or whatever you prefer.
This assembler was made to allow the user to design the data in a flexible and friendly way, as if he was writing in assembly code.

The input scripts contains three types of input : Labels, directives and comments.

=== Comments ====

Commented parts are ignored by the mini-assembler.

It is possible to use comments in a couple of ways :
- C style multiline or less-than-a-line comments /* */
- Single line C++ style starting with //, and assembly-style starting with ;
- Any line starting immediately by a * character is a comment

=== Directives ===

Directives always starts with a dot. There is currently 7 directives implemented :

1) Define bytes directive : .db .dsb or .byte

Should be used to define bytes.
A small pre-processor allows for simple integer calculations to be performed. Plus +, minus -, multiply *, integer divide / and modulo % are implemented.
Values beginning with $ or 0x are in hexadecimal.
Values beginning with % or 0b are in binary.
Values surrounded by ' are chars. By default, the ASCI equivalent of the char is used, but it is possible to map any char to any byte value, including all exotic (unicode) chars. Some chars should be escaped with a \ symbol.
Values surrounded by " are strings, sequence of chars. Some chars within a string should be escaped with a \ symbol.

Characters that might need to be escaped are double-quotes ", the escape symbol \, and characters that starts comments / and ;
It is possible to use any 8-bit value as an escape sequence by writing \x followed by 2 hex digits.
It is possible to use any unicode character as an escape sequence by writing \u followed by 4 hex digits. If the input script is in UTF-8 coding, using them directly is possible.

Examples :
.db $3, $4, $5
.dsb 6, 7, 8
.byte "hello", 0
.db 'h', 'e', 'l', 'l', 'o'
.db (077 + $34)/(2 * 'a') * 0b1101     ;Yes, this line only define a single byte
.db "This is a \"string\""             ;Encodes into : This is a "string"
.db '\\'                               ;Use the code for the \ character
.db '\xff'                             ;Same as .db 0xff or .db $ff
.db '\u1234'                           ;Use unicode character 0x1234


2) Define words directive : .dw .dsw or .word

This defines two bytes at a time as a 16-bit little endian value.
The only difference between .db and .dw is that chars and strings can't be used with .dw

Examples :
.dw 15312
.word $abcd, $cafe

3) Define directive : .def or .define

This allows you to define a label with a value, all later instance of the label will be replaced by the value.

For example :
.define foo $4
.define bar foo + 5
.db bar

is the same as :
.db 9

Note also that labels and local labels are equivalent to defines and can be used as well (only after the label, as this is a primitive single pass assembler).

For example :

Label1
.db $4, $5, $6
Label2
.db Label2 - Label1     ;This will be equal to 3

4) Set origin directive : .org

The origin is used as an offset for labels, and increments on each byte.

For example :

.org $400
Label

Label will be equivalent to $400. By default, the offset starts at 0 at the begining of the input file.

5) Include directive : .include

This simply includes script from another file. The code in the included file is "inlined" in the main file.

Example :
.include "my_file.txt"

6) Include binary directive : .incbin

This include binray data from another file directly.

Example :
.incbin "my_file.bin"

7) Map characters directive : .map

This should be used to encode non-ascii strings. It allows to remap any unicode char to any 8-bit value. Escape sequences for some symbols are required exactly like in .db.
Caution, since version 2.0, the syntax had a major change since version 1.x, which makes old scripts incompatible. This is only for the best though.

For example :
.map 'a' 80		; All instances of "a" will be mapped to decimal 80
.map 'b' 'a' + 1	; All inscances of "b" will be mapped to decimal 81
.map '\"' $42		; All double-quote symbols will be mapped to the 8-bit hex value 0x42
.map '\u1234' 0x56	; All instances of the unicode character 0x1234 will be mapped to the 8-bit hex value 0x56

Any non-mapped symbol has the lower 8-bits of it's unicode value used by default.

Any character that is re-mapped overwrites the previous mapping without warning.

There is 6 shortcuts to map a series of characters in a single command :

1) Map all lower case chars :

.map locs $01    ;Will map 'a' to $01, 'b' to $02, ....., 'z' to $1a

2) Map all upper case chars :

.map upcs $21    ;Will map 'A' to $21, 'B' to $22, ...., 'Z' to $3a

3) Map all numbers :

.map num $40     ;Will map '0' to $40, '1' to $41, ....., '9' to $49

4) Map space :

.map sp $60      ;Will map space character to $60

5) Map hiragana :

.map hiragana $20   ; Will map all hiragana unicode characters $3041 to $3094 : 'ぁ' to $20, 'あ' to $21, ...., 'ゔ' to $73

6) Map katakana :

.map katakana $90   ; Will map all katakana unicode characters $30a1 to $30f4 : 'ァ' to $20, 'ア' to $21, ..., 'ヴ' to $73

It is possible to change the mapping of characters multiple times during a script, and to map multiple characters to a single value.

For example :
.map num 0
.map upcs 10
.map 0 'O'

Will make all instances of the number "0" replaced by the letter "O".

=== Labels ====

Labels must start with a letter, and their body should be made of letters, numbers and the following symbols _ . :
Local labels starts with an underline _
The only difference between labels and local labels is that labels are used to separate different blocks of data, while locals labels have no effect on data structure. 

Example :

ImportantLabel
	.db 1, 2, 3, 4, 5, 6
_localLabel
	.db 1, 2, 3, 4, 5		;Belong to the same data block as the line above

AnotherLabel
	.db 2, 3, 4, 5, 6		;Belong to a new data block

When two consecutive labels follows a "dummy" (empty) data block is created. This has no negative consequence, it will be ignored by the compression algorithms.

For example:

Label1
Label2		;Label1 defines an empty data block
	.db 2, 3, 4

Labels and local labels correspond to an integer value (in a virtual address space). They can be used to compute the length of some data automatically. For example :

Label1
	.incbin "binary_file.bin"
_end_of_binary_file
	.dw _end_of_binary_file - Label1	;Computes the size of the binary file

The character * can be used in an expression as an alias for the current virtual address, without using a label.
If there is any risk of confusion with the * (multiply) opeartor, I suggest using a pair of parenthesis.

Example:
Label1
	.incbin "binary_file.bin"
	.dw *-Label1    ;Same as above example
	.dw (*)*2       ;Return 2 times the current address

*******************************************
* Convert from plaintext to input scripts *
*******************************************

Since version 2.0 of CompressTools, a perl script to convert from plaintext into an usable input script is made. An example batch script is provided that compresses this very file you're reading right now.

This perl script is provided as a simple example, and the user is supposed to adapt it to his needs. In our case, I have decided that any line return means a separation between blocks, that empty lines were not considered, and that all lines are terminated with a NULL (0x00) character, of course this can be adapted depending on how you want the script to be stored before compression (so how you want it to be restored after decompression).

***********************
* Encoding algorithms *
***********************

All encoding algorithms and sub-algorithms are fully commented in the source code, and it is necessary to have a look to at least the algorithm you're going
to use in your project.
I assume the reader is already familiar with the basics of compression algorithms, such as RLE, Huffman and LZ family. I don't feel like explaining the basics
here. If you need to learn the basics, I suggest reading Jay's excellent document, which is available on www.romhacking.net

Summary of the included compression algorithms :

=== RLE and variants ====

I probably don't need to introduce Run Length Encoding (RLE). This is the simplest compression technique that exists. Unfortunately it only works well for several types of data. For example RLE is completely useless when it comes to compressing text. However this is extremely simple to implement on a low resource machine in assembly, and can sometimes compress levels and some graphics decently well.

3  variants were implemented : Standard, BitPack and Escape.
BitPack only works on data that doesn't use all 8-bits in bytes, but will never expand files. Standard and Escape will work on all data, but there is a risk that data is expanded.

=== RLEINC1 and 2 ===

Those codecs were made by Joel Yliluoma specifically to compress tile map data, but could extend to any other data. The idea is to expand RLE by not only encoding runs of identical bytes, but also runs of increasing values, and in the case of RLEINC2, runs of repeating byte pairs patterns.

=== Byte Pair Encoding ===

This is the famous compression that many Japanese RPGs uses for their text, but it could apply to almost any data. Almost because there is a requirement that a range of values have to be unused in the source data. However, there is surprisingly few data sources that doesn't fulfill this condition, intentionally or not.

There is two variants : Regular and recursive. Recursive encoding allows a byte within a pair to point to another byte pair. For this reason,
the recursive variant systematically compress at least as well as the regular version.

This encoding can't expand the data.

=== Huffman and variants ===

There is 3 variants. The regular canonical Huffman algorithm, and two simplifications of it called TinyHuff and TinyHuffFixed.
Both simplifications are faster and simpler to decode, but typically don't compress as well.

In all cases, encodings applies to all data, and there is a risk of seeing the data expanded.

=== LZ77 variants ===

Although dozens of variants are theoretically possible, only 3 of them were implemented so far. But there's a good reason for it.

The regular "RAM" LZ77 is the real algorithm as it was meant to work. Unfortunately this algorithm doesn't enable pseudo-random access to a block of data, and requires a lot of RAM, which makes it unsuitable for (typical) low-RAM embedded applications. An implementation was mainly made to see how well it performs compared to other implemented algorithms.

Two "ROM" variants of LZ77 were also coded, which removes the need of a large RAM buffer, but typically don't compress as well. The 7-bit variant only applies to data where 7 bits are used : This variant sees no risk of expanding the data, but the 8-bit variants does.

Caution : This category of algorithms can be slow for very large data sets.

=== Static Dictionary ===

This algorithm expands the idea of Byte Pair Encoding to arbitrary length "words". Unused byte values are used to store frequent "words" of arbitrary length. Words themselves can include references to defined "words" in the recursive version, but never in the regular version.

Caution : This category of algorithms can be extremely slow for very large data sets. I made a system which displays the progress during compression. This is the reason I made it the last of the list :) If it is too slow the user can force the program to terminate by using Ctrl+C, and nothing will be lost (except obviously that this encoding won't be tested).

It is possible to affect the speed by modifying a few things in the source code (please check the comments for more info). If anyone has an idea how to make encoding faster while still being exact/accurate please contact me.

******************************
* Adding your own algorithms *
******************************

It is also possible for the user to make up his own algorithm and add them to the list. The procedure is really simple, of course it's required to know how to program in C++, but the basics of the language should be enough (I'm myself not an expert in C++, trust me ! I only used C++ because it makes it easy to deal with strings, arrays and things like this). The original was written in Java but migrated to C++ because it's easier to compile the program, does not require any extra downloads except a c++ compiler, and more people are likely to contribute.

Just add a module that include the "encoding.hpp" header, and modify the makefile and main.cpp accordingly, and your own algorithm will be implemented in CompressTools ! Don't hesitate to look at the existing algorithms as (hopefully good) examples.

The requirements are the following :
- Each algorithm should be in his own namespace and include the "encoding.hpp" header, implementing encode(), decode(), get_labels() and get_defines() functions.
- It is possible (and often necessary) to add data blocks as a header or footer, that serves as additional information decompresser.
  This is where the get_labels() and get_defines() functions comes in play. Just look at existing encodings for examples how to handle it.
- It is possible that the encoder can't work with arbitrary input, and make assumption about the data itself. Then it's his role to reject the data when necessary and to
  return an empty "vector" when trying to encode() inadequate data.
- If the encoder needs statistics from the data, it's possible to get it from the global variables declared in the "main.cpp" file, and importing them as "extern".
- encode() can be a very long and complex algorithm, this is no problem. It can encode data blocks individually, or all at the same time, it doesn't matter.
- decode() should however be simple in concept. It should be portable into embedded assembly code, use few memory, and be reasonably fast.
  Normally it should decode blocks individually (i.e. decoding block N doesn't require to decode blocks 0...N-1) but RAM LZ-77 is an exception to this.
- This is not required, but if someone ever implements any good algorithms, I'd sure like to be aware of it and to add them with the official release.
  You will of course be credited.


***** Special thanks *****
- Jay for his excellent document about compression, and his example code (although I finally didn't base my code to his, I preferred to write everything from scratch).
- David A. Huffman for his nice algorithm
- Abraham Lempel & Jacob Ziv for their nice algorithm
- Joel Yliluoma for his RLEINC algorithms
- Whoever invented any other compression method I didn't remember here.

*** Contact ***

To contact me please e-mail me at jmasur at bluewin dot ch.

*** Version history ***

v1.0 18 august 2012
Initial release

v1.1 11 december 2012
Made some improvements based on thefox's suggetions.
More specifially : -fa, -o and -byte options added, semicolons ':' added after labels at output files.

v1.2 26 february 2013
-fa should now specify algorithm's full name (instead of number)
-o can now specify full name with any choosen extension (no .asm or .h forced)
Semicolons not added at the end of labels which already have them
Fixed a bug where using .incbin without a label would cause exeptions instead of printing a understandable error message

v1.3 24 march 2013
Added RLEINC1 and RLEINC2 compressions by Joel Yliluoma

v2.0 21 july 2013
Finally, version 2.0 is here ! A large number of improvements were made :
- Strings and chars can use all unicode chars, and they can be mapped with bytes.
- Escape sequences are now implemented within strings.
- The StaticDictionary algorithm is now much improved, it is still the slowest, especially for large data where it can take a dozen of minutes to compress, however, it is still "usuable", as opposed to it's previous version.
- The @ symbol do no longer start comments.
- A tool to convert from plaintext to input scripts has been made.

v3.0 2nd february 2015
Complete rewrite of the program in C++, does not require the JRE to run any longer
- Fixed a bug in all three LZ77-family algorithms, where the NBITS constant would be wrong in the output file
- Fixed a bug where the non-recursive static dictionary algorithm was actually partially recursive
- Improved the main program a lot (better argument parsing, better input file parsing)
- Improved a little the documentation
